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Abstract 

The Fast Multipole Method (FMM) is well known to possess a bottleneck arising 
from decreasing workload on higher levels of the FMM tree [Greengard and Gropp, 
Comp. Math. Appl., 20(7), 1990]. We show that this potential bottleneck can be 
eliminated by overlapping multipole and local expansion computations with direct 
kernel evaluations on the finest level grid. 
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1 Introduction 

In [1] , Greengard and Gropp give the seminal complexity analysis for the paral- 
lel Fast Multipole Method (FMM) [2]. A notable finding is that the algorithm 
contains a bottleneck which scales as log P, where P is the number of proces- 
sors. This kind of bottleneck is found in many other hierarchical algorithms, 
such as multigrid [3], and results directly from a lack of concurrency. Work 
on a given grid level depends upon results from a finer grid. Although some 
freedom can be exploited [4], grid levels are typically executed sequentially, 
concurrency arising only from operations on a given level. In their analysis, 
Greengard and Gropp also make this tacit assumption. 

However, even allowing for serialization of grid levels, we are left with potential 
concurrency in FMM that is not accounted for in the complexity model. All 
local interaction calculations are independent of the multipole calculations, 
and also are localized to a given cell and its neighbors. The dependency graph 
for the computational stages in FMM is shown in Fig. 1. Notice that no de- 
pendency exists between the multipole calculation, occuring in stages {4, 5, 
6, 7}, and the direct near-field calculation in stage {9}. 
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Fig. 1. Directed Acyclic Graph representing the computational stages of the FMM 
algorithm. Notice that no dependencies exist between {3} and {2,4,5}, and {6,7,8} 
and {9}. This means that these task groups may be executed simultaneously. The 
numbered tasks correspond to: 1. Tree construction 2. Particle binning 3. Interac- 
tion list calculation 4. Particle to multipole calculation 5. Multipole to multipole 
translation 6. Multipole to local transformation 7. Local to local translation 8. Lo- 
cal to particle evaluation 9. Near domain particle evaluation 10. Summing multipole 
and near field contributions. 

If all neighbor data is made available at the start of the computation, these 
local calculations could be used to offset the dearth of concurrency at coarser 
grid levels. In fact, we will show in Section 4 that for nearly every architecture 
in use today, local interaction calculations are sufficient to completely cover 
the bottleneck in the case of our test case from vortex fluid dynamics. 



2 Test problem 



In order to provide concrete values for all coefficients in the complexity es- 
timates, we will focus on a particular test problem, the Euler equations of 
fluid dynamics in two dimensions. We will write the equation in the velocity- 
vorticity formulation, where vorticity is defined as the curl of the velocity, 
w = Vxm [5]. The equations are discretized using the vortex particle method 
over a set of moving nodes located at follows: 

N 

u(x,t) ww ff (ar,t) = J2 7iCr(x, x^t)) (1) 
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where a common choice for the basis function ( a (x,y) is a normalized Gaus- 
sian, 

1 / — \x — y\ 2 \ 

Cr(g.y) = ^exp . (2) 



This formulation was studied previously in [6] for accuracy, and a high per- 
formance implementation was described in [7]. 

The discretized vorticity field in (1) is used in conjunction with the vortic- 
ity transport equation. For ideal flows in two dimensions, the vorticity is a 
preserved quantity over particle trajectories, 

duo „ Duo , , 

_+„. Vw = w = 0, (3) 

and the moving nodes translate with their local velocity, carrying their vortic- 
ity value to automatically satisfy the transport equation. We then calculate 
velocity from the discretized vorticity using the Biot-Savart law, 

u(x,t) — |(VxG)(i- x')u(x', t)dx' — J K(x — x')u(x', t)dx = (K * uo)(x, t) 

where I = V x G is the Biot-Savart kernel obtained from G, the Green's 
function for the Poisson equation, and * represents convolution. In our 2D 
example, the Biot-Savart law is written explicitly as, 

-1 r (x - x') x u(x',t)k . . 
u(x,t) = — / ^ ^ jk^dx'. (4) 

where x = (x\,X2). However, in two dimensions, we can also conveniently 
represent the vector complex number z = u + iv, for which we will also 

use the notation (u,v). 

When the vorticity is discretized using a radial basis function expansion, one 
can always find the analytic integral for the Biot-Savart velocity, resulting in 
an expression for the velocity at each node which is a sum over all particles. 
Using the Gaussian basis function (2), 

K " (2) =2^ ( -"- 1,) ( 1 " eXP ("^))' <5) 
where \z\ 2 = u 2 + v 2 , and the velocity is given by, 

N 

u a (z,t) = Kv(z - Zj(t)). (6) 
i=i 

Thus, the calculation of the velocity of iV vortex particles is an iV-body prob- 
lem, where the kernel K(z) decays quadratically with distance, which makes 
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it a candidate for acceleration using the FMM. Also note that as \z\ becomes 
large, the kernel K(z) approaches l/|-z| 2 . We take advantage of this fact to 
use the multipole expansions of the 1/|<2| 2 kernel as an approximation, while 
the near-field direct interactions are obtained with the exact kernel IK. It has 
been demonstrated that using the expansions for l/|z| 2 does not impair the 
accuracy as long as the local interaction boxes are not too small [6]. 



3 Complexity Estimates 



Greengard and Gropp begin by assuming a constant number B of particles 
per cell on the finest tree level L. In dimension D, as a function of N = 2 DL B, 
the total number of particles, and P, the number of processes, the runtime T 
is given by 

N N NB 

T = a- + b \og 2D P + C — + d— + e(N, P) (7) 

where a-d are constants explained below, and e subsumes the lower order 
terms. We will not address the lower order terms in e further. For the fol- 
lowing discussion, we assume a two dimensional tree for concreteness. Higher 
dimensional trees will be addressed in the final section. We also assume a flop 
rate r for the machine. In order to calculate the coefficients in (7), we will 
analyze the test problem from vortex fluid dynamics shown in Section 2. 

The first term in (7) subsumes all perfectly parallel work in the FMM, namely 
the initialization of multipole expansions, the evaluation of local expansions 
on the finest tree level, and the final sum of multipole and direct contributions. 
From above, our test problem uses the Biot-Savart kernel in two dimensions, 
and complex numbers to represent positions, and we recall (5) 



We carry out a t term multipole expansion in which the mth coefficient is 
given by 

if r p (x p -f c ) m (8) 

particles 

where T p is the circulation of particle p and x p its position, and x c the cell 
center. For details, please refer to [7]. We may now calculate the constant a 
precisely. We will split the different contributions into pieces so that J2i a i = a - 

The work to initialize all multipole expansions to order t, assuming 6 flops per 
complex multiplication, is given by 
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^init = 2 + ]T]r(3 + (6 + 2)t) 

b V 

= 2 + (8t + 3)N 

where b runs over boxes on the finest level and p runs over particles in a given 
box b. We move the constant term to the low order part e. Thus our total time 
for initialization is 

(8t + 3)N 

^ init = ~ p (9) 

and we have 

ai = <«±3>. (10) 
r 

Since, we calculate the local expansions coefficients using the tree structure, 
we need only produce the local series (x p — x c ) m in order to evaluate the 
contribution at each particle location. Thus we have 

a 2 = . (11) 

r 

and for the final summation 

2 , N 

a 3 = -. (12) 
r 

The perfectly parallel work is therefore 

T^™ (!3) 



The second and third terms represent the development of multipole expansions 
for all tree cells by translation (M2M) and addition of expansions in an upward 
pass through the tree. The work is constant per tree cell, so the total work is 
given by 

L-1 L-1 

W up = 4 Z 4(2 + 2t 2 + At(t - 1)) = Cl 4 '- 

1=2 1=2 

However, when we look at the parallel work, after a certain level, log 4 P, there 
are fewer tree cells than processes and thus some processes become idle, 



T up = Cl J24 l (14) 

1=2 

( L-1 aI log 4 P \ 

= * £ 15+ E 1 (15) 

V=log 4 P r 1=2 ) 

( l 4 L — 4 lo §4 p \ 
= ci(p 4-1 +log 4 P~2 (16) 



= Ci 



N 
3BP 



+ log 4 P-^. (17) 
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We move the constant into the low order terms e, and will add the first term 
into the expression for c. The second term represents the reduction bottleneck 
and could limit the scalability of FMM. In fact, this is stated by [1,8] and 
all works known to the author. However, this conclusion ignores potential 
concurrency in the algorithm which will be addressed in Section 4. 

In order to quantify this potential concurrency, we will have to consider the 
fourth term, which describes the direct interaction between particles in the 
same and neighboring tree cells. We will compute each particle pair twice, 
once from each end, to eliminate memory contention issues at the cost of 
additional flops. The work done depends on the number of neighboring boxes, 
so we have three terms 



iy direct = 22 [A(AB 2 -B) + (2 L+2 - 8)(65 2 - B) + (4 L - 2 i+2 + 4)(95 2 - B) 
= 22 (94 L 5 2 - 3(2 L+2 - 8)B 2 - 20B 2 - A L B) 




22 | Q^B 2 - 12\I^B' Z + AB Z — N 

where, in the first line, the first term counts flops for the four corner boxes, the 
second the boxes along the four sides, and the third count flops in the interior 
boxes. Here we use 9 flops for complex division and 1 flop for the exponential. 
The second and third terms of the last line can be moved to the lower order 
e, while the fourth term can be used to correct the perfectly parallel term 

22 

a A = . (18) 

r 

The first term then gives the dominant contribution to the time 

T^™ N 4, (19) 
r P 

so that d = 198/r. 

The third term involves three parts: a transformation of the multipole ex- 
pansion to a local expansion (M2L) in each cell, a reduction of the new local 
expansions for each cell, and a translation of the full local expansion to child 
cells. The M2L transformation and reduction does work 



W M 2L = E 4 ' 27 ( 2 + 2t2 + 15t2 ) 
1=2 

aL+1 _ a2 
,2 , ir.4.2\^ * 



27(2 + 2^ + 15^, 

v ' 4- 1 

9(2 + 2t 2 + 15t 2 ) (A L+1 - 16) 
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Cl 


I fl P.+ — 1 K\ 


b 


i (84 - 48t + 571t 2 ) 


c 


i (128 - 48t + 844t 2 ) 


d 


^198 



Table 1 

Coefficients for the complexity expression in (7) in terms of the flop rate r and the 
expansion order t. 

where we have used 27 as the maximum interaction list size. Since the work 
done by cells with smaller interaction lists will be smaller by a factor of iV -1 / 2 , 
we neglect them here. The L2L translations do work 



W L2L = ^2A l 4(2 + 2t 2 + 8t 2 ) 

1=3 

= ^(2 + 2t 2 + 8t 2 ) (4 L+1 - 64) . 

We can now give the time estimate 

4 26 + 26t 2 + 167t 2 N 



■ down 



3 r BP 

where we have discarded lower order terms, so that we have 

104 + 772t 2 



Co 



3r 



(20) 



(21) 



Thus, for the specific case of our test problem, we can give values for all 
coefficients in the complexity estimate of (7) in terms of the flop rate r, as 
shown in Table 1. 



4 Concurrency 



Our hypothesis is that computation of the local direct interaction among 
neighboring particles can be done concurrently with multipole calculations 
on coarse grid levels in order to maintain full utilization of the machine. In 
order to determine whether local interaction calculations can be used to cover 
a loss of concurrency at coarser grid levels, we must decide how many particles 
will be allocated to each box. Greengard and Gropp determine the optimal 
number of particles per box by minimizing the total time. Finding the mini- 
mum of (7), they obtain 

B opt = t/5 ~ 30 ' ( 22 ) 
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where they have used, in 2D, c = 25s 2 , d = 9, and in single precision s = 15. 



Following this analysis, Gumerov and Duraiswami [8] determine B opt in the 
same manner, but with slightly different constants so that 

« 91. (23) 

However, they use 320 particles in each box for the examples in their paper. 
This will results in higher computational performance, but suboptimal total 
running time. For our test problem, we have 



„ 128 - 48t + 844t 2 

B opt = ^ — -18 (24) 

for t = 15. If this exceeds the minimum B necessary to cover the reduction 
bottleneck, we make no tradeoff in reorganizing our computation for enhanced 
concurrency. 

Using our model from Section 3, we can balance the time in direct evaluation 
with idle time for small grids. Assume that only a single thread, or thread 
group, works on the first L root tree levels, so that L root is the level where we 
start to lose concurrency and can no longer occupy each thread with a separate 
box. The total time for direct evaluation is given by 

,NB , v 

d— , (25) 



and the work on the root tree by 



Thus, we need 



bL root . (26) 



B > -j 1 -^- (27) 
~ d N v ' 

in order to cover the bottleneck completely with direct evaluation calculations. 
Consulting Table 1, we have 



84 - 48t + 571t 2 PL root 

~ 594 N ^ ' 

PL 

> 215.22 ™ ot (29) 

(30) 

where we have used t — 15. It is common to take L root = log 4 P as in the GG 
model, so that 

B > 215.22 31 

N 
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Minimum problem size per process at optimal binning (B = 18) 




10° 10 1 io 2 io 3 io 4 io 5 io 6 

Process count (P) 



Fig. 2. The minimum problem size per process, N/P, for which no bottleneck will 
appear as a function of the number of processes P. 



We see that in order to achieve complete overlap of these computations, mem- 
ory must scale slightly faster than linearly. Although this may ultimately limit 
scalability, even machines an order of magnitude larger than today will not 
test this limit. 

If we take a typical problem with N = 10 6 and P = 10 4 , we see that 

B > 14.30 (32) 

is enough to eliminate the barrier to scaling. Thus, the optimal B is enough to 
guarantee optimal scaling of only a million particles on a modern GPU cluster, 
such as the Tesla, if the computation is reorganized to allow this overlap. 

If we instead demand that N/ P is a fixed constant M, so that we have memory 
scalability, and fix B at its optimal value 18, we have 

M > 12 log 4 P. (33) 

Thus even a very large machine, with more than one million cores, would 
need to assign no more than 120 particles per core. This can be seen clearly in 
Fig. 4, which shows the minimum problem size per process, N/P, for which 
no bottleneck will appear as a function of the number of processes P. 
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5 Conclusions 



We have shown that the potential bottleneck due to decreasing workload on 
higher levels of the FMM tree can be alleviated by overlapping multipole 
computations with direct kernel evaluations on the finest level grid. The im- 
plementation of this overlap on the NVIDIA Tesla architecture will be detailed 
in an upcoming publication. 

This analysis could be extended for more general scenarios, and we posit 
two immediately relevant examples. First, uneven particle distributions will 
produce more direct computation for the same N than the evenly distributed 
case, so that the estimates remain valid and the bottleneck can be avoided. 
If instead the density of particles per box remains constant, the estimates do 
not change and the bottleneck disappears as well. Second, in three or higher 
dimensions, the balance of multipole work to direct computation will change. 
Since the size of the interaction list will increase quickly, it is likely that 
minimum problem size to eliminate the bottleneck will increase, however the 
optimal number of particle per box will also increase. Thus, a more detailed 
analysis will be necessary in this case. 
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